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Abstract 

JFe propose an improvement of the known DFT- 
based indexing technique for fast retrieval of similar 
time sequences. We use the last few Fourier coeffi- 
cients in the distance computation without storing 
them in the index since every coefficient at the end 
is the complex conjugate of a coefficient at the be- 
ginning and as strong as its counterpart. We show 
analytically that this observation can accelerate the 
search time of the index by more than a factor of 
two. This result was confirmed by our experiments, 
which were carried out on real stock prices and syn- 
thetic data. 

Keywords similarity retrieval, time series index- 
ing 

1 Introduction 

Time sequences constitute a large amount of data 
stored in computers. Examples include stock prices, 
exchange rates, weather data and biomedical mea- 
surements. We are often interested in similarity 
queries on time-series data [APWZ95 ALSS95]. For 
example, we may want to find stocks that behave 
in approximately the same way; or years when the 
temperature patterns in two regions of the world 
were similar. 

There have been several efforts to develop access 
methods for efficient retrieval of similar time se- 
quences JAFS91 |FRM94|, |lM97|, [YJF9§]. Agrawal 
et al. [AFS93| propose an efficient index struc- 
ture to retrieve similar time sequences stored in 
a database. They map time sequences into the 
frequency domain using the Discrete Fourier Trans- 
form (DFT) and keep the first few coefficients in 
the index. Two sequences are considered similar if 
their Euclidean distance is less than a user-defined 
threshold. 

In this paper, we propose using the last few 
Fourier coefficients of a time sequence in the dis- 
tance computation, the main observation being that, 
every coefficient at the end is the complex conju- 
gate of a coefficient at the beginning and as strong 
as its counterpart. This observation reduces the 



search time of the index by more than 50 percent 
in most cases. 

The rest of the paper is organized as follows. In 
the next section we review some background mate- 
rial on related work and on the discrete Fourier 
transform. Our proposal on the efficient use of 
DFT in retrieving similar time sequences is dis- 
cussed in Section ||. In the same section, we present 
analytical results on the search time improvements 
of our proposed method. Section [3] discusses the 
performance results obtained from experiments on 
real and synthetic data. Section |5| is the conclusion. 

2 Background 

In this section, we briefly review background mate- 
rial on past related work and on the discrete Fourier 
transform. 

2.1 Related Work 

There has been some follow-up work on the index- 
ing technique propo sed by Agrawal et al. [ AFS93 |. 
In an earlier work [RM97], we use this indexing 
method and propose techniques for retrieving sim- 
ilar time sequences whose differences can be re- 
moved by a linear transformation such as mov- 
ing average, time scaling and inverting. In an- 
other work | Raf98| , we generalize this framework to 



multiple transformations. More follow-up work in- 



cludes the work of Faloutsos et al. [ FRM94 ] on sub- 
sequence matching and that of Goldin et al. [GK95| 
on normalizing sequences before storing them in 
the index. 

In this paper, we use the indexing technique 
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proposed by Agrawal et al. | AFS93 1 , but in addition 
to the first few coefficients we also take the last 
few coefficients into account. Both our analytical 
results and our experiments show that this obser- 
vation accelerates the retrieval speed of the index 
by more than a factor of 2. All follow-up works 
described earlier benefit from this performance im- 
provement. 

There are other related works on time series 
data. A domain-independent framework for posing 
similarity queri es on a database is developed by 
Jagadish et al. HlMM95 |. The framework has three 
components: a pattern language, a transformation 



rule language, and a query language. The frame- 
work can be tuned to the needs of time sequences 
domain. Yi et al. [YJF98| use time warping as 
a distance function and present algorithms for re- 
trieving similar time sequences under this function. 
Agrawal et al. [APWZ95] describe a pattern lan- 
guage called SDL to encode queries about "shapes" 
found in time sequences. A query language for time 
series data in the stock market domain is developed 
by Roth ]Rot93| The lan guage is built on top 
of CORAL" 1rS92 1. and every query is translated 
i nto a s equence of CORAL rules. Seshadri et al. 
|5LR94| develop a data model and a query lan- 
guage for sequences in general but do not mention 
similarity matching as a query language operator. 

2.2 Discrete Fourier Transform 

Let a time sequence be a finite duration signal x = 
[x t ] for t = 0, 1, • • • , n - 1. The DFT of x, denoted 
by A, is given by 



n— 1 

— T 



x t e 



f = 0, n-1 (1) 



where j = \J — 1 is the imaginary unit. Through- 
out this paper, unless it is stated otherwise, we 
use small letters for sequences in the time domain 
and capital letters for sequences in the frequency 
domain. The energy of signal x is given by the 
expression 



E(x) 



E 



(2) 



A fundamental observation that guarantees the 
correctness of the index ing m ethod for time series 
data is Parseval's rule | OS75|| , which states for a 
given signal x its energy remains the same after 
DFT, i.e. 

E(x) = E(X) (3) 

where X is the DFT of x. Using Parseval's rule 
and the linearity property o f DFT (for example, 
see Oppenheim and Schafer pS75 | for details), it 
is easy to show that the Euclidean distance between 
two signals in the time domain is the same as their 
distance in the frequency domain. 

D 2 (x,y) = E{x-y) = E(X -Y) = D 2 (X,Y) (4) 

3 Storage and Retrieval of Similar 
Time Sequences 

Given a set of time series data, we can construct an 



index ( ]AFS93| ) as follows: find the DFT of each 
sequence and keep the first few DFT coefficients as 
the sequence features. Let's assume that we keep 
the first k coefficients. Since all DFT coefficients 
except the first one are complex numbers, keeping 
the first k DFT coefficients maps every time series 



into a point in a (2k— l)-dimensional space. These 
points can be organized in a multidimensional in- 
dex such as R-tree family put 84 |BKSS90[ or grid 
files [NHS84]. Keeping only the first k Fourier coef- 
ficients in the index does not affect the correctness 
because the Euclidean distance between any two 
points in the feature space is less than or equal 
to their real distance due to Parseval's rule and 
the monotonic property of the Euclidean distance. 
Thus, the index always returns a superset of the 
answer set. However, the performance of the index 
mainly depends on the energy concentration of se- 
quences within the first k Fourier coefficients. It 
turns out that a large class of real world sequences 
concentrate the energy within the first few coeffi- 
cients, i.e. they have a skewed energy spectrum of 
the form 0(F~ 2b ) for b > 0.5 where F denotes the 
frequency. For example, classical music and jazz 
fall in the class of pink noise whose energy spec- 
trum is OiF- 1 ) ( |WS90| , |Sch9ljl ), stock prices and 
exchange rates fall in the class of brown noise whose 
energy spectrum is 0(F~ 2 ) ( ]Man83| , |Cha84j ), and 
the water level of rivers falls in the class of black 



noise for which b > 1 (^fan83| pch9l| ). 

To retrieve similar time sequences stored in the 
index we may invoke one of the following spatial 
queries: 

• Range Query: Given a query point Q and 
a threshold e, find all points X such that the 
Euclidean distance D(X,Q) < e. 

• Nearest Neighbor Query: Given a query 
point Q, find all points X such that the Eu- 
clidean distance D(X, Q) is the minimum. Sim- 
ilarly, a A;-nearest neighbor query asks for the 
k closest points of a given point. 

• All-Pair Query: Given two multidimensional 
point sets s\,S2 C S and a threshold e, find all 
pairs of points (X, Y) E s\ x S2 such that the 
Euclidean distance D(X,Y) < e. 

Suppose we want to answer a range query using the 
index, i.e., to find all sequences X that are within 
distance e of a query sequence Q, or equivalently 
D(X, Q) < e. A common approach to answer this 
query is to build a multidimensional rectangle of 
side 2e (or a multidimensional circle of radius e) 
around Q and check for overlap between the query 
rectangle (circle) and every rectangle in the index. 
That is, instead of checking D 2 (X,Q) < e 2 , we 
check \X f - Q f \ 2 < e 2 for / = 0, . . . , k - 1. The 
latter is a necessary (but not sufficient) condition 
for the former. 

The size of the query rectangle has a strong 
effect on the number of directory nodes accessed 
during the search process and the number of candi- 
dates which includes all qualifying data items plus 
some false positives (data items whose full database 



records do not intersect the query region) . Our goal 
here is to reduce the size of the query region, using 
the inherent properties of DFT, without sacrificing 
the correctness. 

3.1 Our Proposal 

The following lemma is central to our proposal. 

Lemma 1 The DFT coefficients of a real-valued 
sequence of duration n satisfy X„_y = for f 
1 I where the asterisk denotes complex con- 



jugation^ 



Proof: See Oppenheim and Schafer ]OS75 , page 



25] 

This means the Fourier transform of every real- 
valued sequence is symmetric with respect to its 
middle. A simple implication of this lemma is 
|X n _/| = \Xf\, i.e. every amplitude at the begin- 
ning except the first one appears at the end. 

Observation 1 In the class of (real-valued) time 
sequences that have an energy spectrum of the form 
0{F- 2b ) for b > 0.5, the DFT coefficients are not 
only strong at the beginning but also strong at the 
end. 

This means if we do our distance computations 
based on only the first k Fourier coefficients, we 
will miss all the information carried by the last k 
Fourier coefficients which are as important as the 
former. However, the next observation shows that 
the first k Fourier coefficients are the only features 
that we need to store in the index. 

Observation 2 The first \(n + l)/2] DFT coeffi- 
cients of every (real-valued) time sequence contain 
the whole information about the sequence. 

The point left to describe now is how we can 
take advantage of the last k Fourier coefficients 
without storing them in the index. We can write 
the Euclidean distance between two time sequences 
x and q, using equations ^ and ^, as follows: 



D 2 (x,q) = D 2 (X,Q) = J2\Xf - Qf\ 2 



(5) 



where X and Q are respectively DFTs of x and q. 
Since |A„_/| = \X f \ and \Q n -f\ = \Q f\ for 
/ = 1, . . . , n— 1, we can write D 2 (X, Q) as follows: 

D 2 (X,Q) = \X -Qo\ 2 + 



E n f LVz\Xf-Qf\ 2 + 

\X n / 2 -Q n /2\ 2 for even r, 
E/Ti 1)/2 2|X/ - Q f \ 2 for odd n 



(6) 



A necessary condition for the left side to be less 
than e 2 is that every magnitude on the right side 
be less than e 2 . For the time being and just for the 
purpose of presentation, we assume time sequences 
are normalized ^ before being stored in the index. 
In general, time sequences may be normalized be- 
cause of efficiency reasons [GK95 or other useful 
properties [Raf98]. Since the first Fourier coeffi- 
cient is zero for normalized sequences, there is no 
need to store it in the index. In addition, since k is 
usually a small number, much smaller than n, we 
can assume that the (n/2)th coefficient is also not 
stored in the index. Now the condition left to be 
checked on the index is 



2\X f -Q f \ 2 <e 2 forf = l,. 



,k 



or, equivalently 



Ha + bj)* = (a-bj) 



\X f -Q f \<-j= forf = l,...,k 

A common approach to check this condition is 
to build a search rectangle of side ^J= = \/2e (or 

a circle of diameter ^/2e) around Q and check for 
an overlap between this rectangle (circle) and ev- 
ery rectangle in the index. The search rectangle 
still guarantees to include all points within the Eu- 
clidean distance e from Q, but there is a major drop 
in the number of false positives. The effect of re- 
ducing the size of the search rectangle on the search 
time of a range query is analytically discussed in 
the next section. 

The symmetry property can be similarly used 
to reduce the size of the search rectangle even if 
sequences are not normalized. The only difference 
is that one side of the search rectangle (the one 
representing the first DFT coefficient^) is 2e and 
all other sides are \/2e. 

We can show that all-pair queries also benefit 
from the symmetry property of DFT. Suppose we 
want to answer an all-pair query using two R-tree 
indices, i.e., to find all pairs of sequences that are 
within distance e form each other. A common ap- 
proach for processing this query is to take pairs 
of (minimum bounding) rectangles, one rectangle 
from each index, extend the sides of one by 2e 
and check for a possible overlap with the other. 
However, the symmetry property implies that if 
we extend every side by y2e, the result is still 
guaranteed to include all qualifying pairs though 
the number of false positives is reduced. 

3.2 Analytical Results on the Search Time 
Improvements 

There are two factors that affect the search time 
of a range query, if we assume the CPU time to 

2 A sequence is in normal form if its mean is and its 
standard deviation is 1. 

3 Note that the first DFT coefficient is a real number. 



be negligible; one is the number of index nodes 
touched by the query rectangle and the other is 
the number of data points inside the search rect- 
angle (or candidates). Both factors can be approx- 
imated by the area of the search rectangle, if we 
assume data points are uniformly distributed over a 
unit square, and the search rectangle is a rectangle 
within this square Thus, to compare the search 
time of a rectangle of side \J~2t to that of a one of 
side 2e, we compare their areas. 

Since a search rectangle has 2k sides, the area 
(or the volume) of a search rectangle of side ^/2e 
is (V2e) 2k — 2 k e 2k . This is one 2 fc th of the area 
(or the volume) of a rectangle of side 2e which is 
(2e) 2fc = 2 2k e 2k . Thus under the assumptions we 
have made, using a search rectangle of side y/2e 
instead of a one of side 2e should reduce the search 
time by (1 — l/2 fc )*100 percent. For example, using 
a rectangle of side \[2e on an index built on the 
first two non-zero DFT coefficients should reduce 
the search time by 75 percent. 

However, for the class of time sequences that 
have an energy spectrum of the form 0(F~ 2b ), the 
amplitude spectrum follows 0(F~ b ). In particular 
for b > 0, the amplitude reduces as a factor of fre- 
quency and points get denser in higher frequencies. 
If we assume that the first non-zero DFT coefficient 
(for every data or query sequence) is uniformly 
distributed within a unit square, the ith DFT coef- 
ficient (for i — 1, . . . , k) must be distributed uni- 
formly within a square of side i~ b . Thus keep- 
ing the first k Fourier coefficients maps sequences 
into points which are uniformly distributed within 



-b 



>> 



rectangle R = (< 0,1 >,< 0,1 >,< 0,2 
< 0, 2- b >, . . . , < 0, k- b >, < 0, k- b >). 

In addition, a search rectangle built on an arbi- 
trarily chosen query point Q (inside or on R) is not 
necessarily contained fully within R. If Q happens 
to be a central point of R, the overlap between the 
two rectangles reaches its maximum. We refer to 
this query as 'the worst case query' since it requires 
the largest number of disk accesses. On the other 
hand, if Q happens to be a corner point of R, 
the overlap between the two rectangles reaches its 
minimum. We call this query 'the best case query'. 
Thus the area of the overlap between the search 
rectangle and i?, and as a result the search time, is 
not only a factor of e but also a factor of Q. 

To compare the search time of a query rectangle 
of side \[2t to that of one of side 2e, we can com- 
pare their area of overlap with R. The projection 
of the overlap between a search rectangle of side 
2e and R to the ith DFT coefficient plane is a 
square of side min(i~ b , 2e) for the worst case query 
and a square of side min(i~ b , e) for the best case 
query. Thus the area of the overlap between the 
search rectangle and R for the worst case query 

4 We relax our assumptions later in this section. 



R 



• : the best case query point 

* : the worst case query point 



is Ili=i( m * n (* b ?2e)) 2 and that for the best case 
query is Y\^=i{min{i~ b , e)) 2 . 

To eliminate the effect of the size of R in our 
estimates, we divide the area of the overlap by the 
area of R, i.e. lli=i(*~ b ) 2 i to get what we call 
the query selectivity. The query selectivity for the 
worst case query using a search rectangle of side 2e 
can be expressed as follows: 



S(b,k,2e) 



UU(min(i- b ,2e)) 2 



-b\2 



Y\{min{r b \2e)i b f 



(7) 



The term min(i , 2e)i is 1 for i b < 2e (or i > 
(2e)~ 1/b ) , and it is 2ei b for i~ b > 2e (or i < 
{2e)- 1 l b ). Thus the query selectivity can be ex- 
pressed as 



i(k,L(2i 



J) 



S(b,k,2e) 



n 



(2d 



fj\2 



(8) 



It can be easily shown that S(b, k, e) gives the query 
selectivity for the best case query using the same 
search rectangle. If we employ the symmetry prop- 
erty of the DFT, i.e. use a search rectangle of 
side V2e, the query selectivities for the worst and 
the best case queries would be S(b, k, V2e) and 
S(b, k, -jm) respectively. 

Figure [l] shows the worst case query selectiv- 
ity per search rectangle and k varying the query 
threshold for Brownian noise data (b = 1). As is 
shown, using the symmetry property reduces the 
query selectivity by 50 to 75 percent for k = 2 and 
e < 0.5. If we keep the first three non-zero DFT 
coefficients (fc — 3), using the symmetry property 
reduces the selectivity by up to 87 percent. In 
general, taking the symmetry property into account 
reduces the selectivity and as a result the search 
time in the worst case by 50 to (1 — l/2 fc ) * 100 
percent for k > 2 and e < 0.5. 

Figure || shows the best case query selectiv- 
ity per search rectangle and k varying the query 
threshold again for the Brownian noise data. As is 
shown, taking the symmetry property into account 
reduces the selectivity by at least 75 percent for 
all values of e < 0.5, if we keep only the first 
two non-zero DFT coefficients. In general, taking 



4 Experiments 

To show the performance gain of our proposed method, 
we implemented it using Norbert Beckmann's Ver- 



sion 2 implementation of the R*-tree [BKSS9C] and 



compared it to the original indexing method pro- 



posed by Agrawal et al. [AFS93|. All our exper- 
iments were conducted on a 168MHZ Ultrasparc 
station. We ran experiments on the following two 
data sets: 

1. Real stock prices data obtained from the FTP 
site "ftp.ai.mit.edu/pub/stocks/results". The 
data set consisted of 1067 stocks and their 
daily closing prices. Every stock had at least 
128 days of price recordings. 

2. Random walk synthetic sequences each of the 
form x = [xt] where Xt — £t-i + z t and z t is 
a uniformly distributed random number in the 
range [—500,500]. The data set consisted of 
20,000 sequences. 

We first transformed every sequence to its nor- 
mal form, and then found its DFT coefficients. We 
kept the first k DFT coefficients as the sequence 
features. Since a DFT coefficient was a complex 
number, a sequence became a point in a 2fc-dimensional 
space. But the first DFT coefficient was always 
zero for normalized sequences, and we did not need 
to store it in the index; instead, we stored the mean 
and the standard deviation of a sequence along with 
its k — 1 DFT coefficients. In our experiments we 
used the polar representation for complex numbers. 

To do the performance comparison, we used both 
range and all-pair queries. For range queries, we 
ran each experiment 100 times and each time we 
chose a random query sequence from the data set 
and searched for all other sequences within distance 
e of the query sequence. We averaged the execution 
times from these runnings. Our all-pair queries 
were spatial self-join queries where we searched the 
data set for all sequence pairs within distance e of 
each other. 



Figure 2: Query selectivity per search rectangle 
and k varying the threshold for the best case query 
on Brownian noise data 



4.1 Varying the query threshold 

Our first experiment was on stock prices data con- 
sisting of 1067 time sequences each of length 128. 
Our aim was to make a comparison between aver- 
age case query selectivities obtained experimentally 
and the extreme case query selectivities computed 
analytically. We fixed the number of DFT coeffi- 
cients to 2, but we varied the query threshold from 
1 * Max Amp to 0.24 * Max Amp where Max Amp 
was the maximum amplitude of the first non-zero 
DFT coefficient over all sequences in the data set. 
Under this setting, a threshold e * MaxAmp in 
our experiments was equivalent to threshold e in 
our analytical results. The average output size for 
e = 1 * MaxAmp was 75 out of 1068 and that 




Figure 3: Both query selectivities and running times for range queries varying the query threshold 



for e = 0.24 * MaxAmp was zero, so we didn't 
try smaller thresholds. Since query points were 
chosen randomly, we expected the query selectivity 
for every threshold e * MaxAmp to fall between 
the two extreme selectivities (the worst case and 
the best case) computed analytically for e. As is 
shown in Figure ^ for e/MaxAmp < 0.5, using the 
symmetry property reduces the query selectivity by 
53 to 64 percent and the search time by 70 to 74 
percent. It is consistent with our analytical results. 
For 0.5 < e/ MaxAmp < 1, as the figure shows, 
using the symmetry property reduces the query 
selectivity by 45 to 64 percent and the running time 
by 62 to 74 percent. 

4.2 Varying the number of DFT coefficients 

Our next experiment was again on stock prices 
data, but this time we fixed the query threshold for 
range queries to 0.95 * MaxAmp and that for all- 
pair queries to 0.32*MaxAmp. This setting gave us 
average output sizes of 30 and 203 respectively for 
range and all-pair queries. We varied the number 
of DFT coefficients kept in the index from 1 to 
4. Figure ^ shows the running times per query for 
range and all-pair queries. Taking our observations 
into account reduces the search time of the index 
by 66 to 72 percent for range queries and by 61 to 
72 percent for all-pair queries. 

4.3 Varying the number of sequences 

In our next experiment, we fixed the number of 
DFT coefficient to 2 and the sequence length to 
128, but we varied the number of sequences from 
100 to 1067. The experiment conducted on stock 
prices data set. We again fixed the query threshold 
for range queries to 0.95 * MaxAmp and that for 



all-pair queries to 0.32 * MaxAmp. Figure |5| shows 
the running times per query for range and all-pair 
queries. Our observation reduces the search time 
of the index by 63 to 71 percent for range queries 
and by 64 to 72 percent for all-pair queries. 

4.4 Varying the length of sequences 

Range Queries 




Ql , , , , 

1 00 200 300 400 500 600 
Sequence length 

Figure 6: Running times for range queries varying 
the length of sequences 

Our last experiment was on synthetic data where 
we fixed the number of DFT coefficients to 2 and 
the number of sequences to 20,000, but we varied 
the sequence length from 128 to 512. The size of 
the data file was in the range of 40 Mbytes (for se- 
quences of length 128) to 160 Mbytes (for sequences 
of length 512). We fixed the query threshold to 
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Figure 4: Running times for range and all-pair queries varying the number of DFT coefficients 




Figure 5: Running times for range and all-pair queries varying the number of sequences 



0.44 * Max Amp and, based on our analytical re- 
sults, we expected using the symmetry property to 
reduce the search time by 50 to 75 percent. Fig- 
ure H shows the running times per query for range 
queries. Our proposed method reduces the search 
time of the index by 73 to 77 percent. The search 
time improvement is slightly more than our ana- 
lytical estimates mainly because of the CPU time 
reduction for distance computations which is not 
accounted for in our analytical estimates. Because 
of the high volume of data, experiments on all-pair 
queries were very time consuming. For example, 
doing a self-join on sequences of length 512 did not 
finish after 12 hours of overnight running. For this 
reason, we did not report them. 

5 Conclusions 

We have proposed using the last few Fourier co- 
efficients of time sequences in the distance com- 
putation, the main observation being that every 
coefficient at the end is the complex conjugate of 
a coefficient at the beginning and as strong as its 
counterpart. Our analytical observation shows that 
using the last few Fourier coefficients in the dis- 
tance computation accelerates the search time of 
the index by more than a factor of two for a large 
range of thresholds. We also evaluated our pro- 
posed method over real and synthetic data. Our 
experimental results were consistent with our ana- 
lytical observation; in all our experiments the pro- 
posed method reduced the search time of the index 
by 61 to 77 percent for both range and all-pair 
queries. 
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